The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
Domain Knowledge
Customer Attrition: The core problem is customer churn (attrition) in the credit card sector. Understanding why customers leave is crucial for retention strategies.
Revenue Streams: Banks earn revenue from credit cards through various fees (annual, late payment, balance transfer, etc.). The balance between usage-based and fixed fees is important.
Customer Behavior: The data captures a mix of demographics (age, education, income), relationship metrics (time with bank, products held), and credit card usage patterns (spending, balance, inactivity).
Potential Hypotheses
We can categorize hypotheses to explore relationships between customer attributes and attrition:
Demographic Factors:
Hypothesis 1: Younger customers are more likely to churn due to changing financial priorities.
Hypothesis 2: Customers with higher education levels might be less likely to churn due to better financial literacy.
Hypothesis 3: Higher income customers might be less sensitive to fees and less likely to churn.
Relationship Factors:
Hypothesis 4: Customers with longer relationships with the bank are less likely to churn.
Hypothesis 5: Customers with more products at the bank are more likely to stay due to higher engagement.
Hypothesis 6: Frequent contact with the bank (e.g., customer service) might reduce churn.
End goal
To build a model that can predict whether a customer is likely to become an "Attrited Customer" so that the bank can take proactive measures to retain them.
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --
# Data manipulation
import numpy as np
import pandas as pd
import seaborn as sns
# Visualization
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
# remove warning errors
import warnings
warnings.filterwarnings('ignore')
# Scikit learn
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
# SMOTE
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
df = pd.read_csv("/content/BankChurners.csv")
# Initial first rows
df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | ... | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | ... | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | ... | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | ... | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | ... | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
5 rows × 21 columns
# Initial last few rows
df.tail()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | ... | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10122 | 772366833 | Existing Customer | 50 | M | 2 | Graduate | Single | $40K - $60K | Blue | 40 | ... | 2 | 3 | 4003.000 | 1851 | 2152.000 | 0.703 | 15476 | 117 | 0.857 | 0.462 |
| 10123 | 710638233 | Attrited Customer | 41 | M | 2 | NaN | Divorced | $40K - $60K | Blue | 25 | ... | 2 | 3 | 4277.000 | 2186 | 2091.000 | 0.804 | 8764 | 69 | 0.683 | 0.511 |
| 10124 | 716506083 | Attrited Customer | 44 | F | 1 | High School | Married | Less than $40K | Blue | 36 | ... | 3 | 4 | 5409.000 | 0 | 5409.000 | 0.819 | 10291 | 60 | 0.818 | 0.000 |
| 10125 | 717406983 | Attrited Customer | 30 | M | 2 | Graduate | NaN | $40K - $60K | Blue | 36 | ... | 3 | 3 | 5281.000 | 0 | 5281.000 | 0.535 | 8395 | 62 | 0.722 | 0.000 |
| 10126 | 714337233 | Attrited Customer | 43 | F | 2 | Graduate | Married | Less than $40K | Silver | 25 | ... | 2 | 4 | 10388.000 | 1961 | 8427.000 | 0.703 | 10294 | 61 | 0.649 | 0.189 |
5 rows × 21 columns
Insight
# Shape of dataset
df.shape
(10127, 21)
total 21 columns with 10,172 rows
# Data types
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Insight
Total 21 columns, mixed of dtypes: float64(5), int64(10), object(6) for 10,127 total entries
Missing values present for Education Level and Marital status which will need to be addressed
The CLIENTNUM column appears to be unique for each customer, confirming it's a valid identifier and can be dropped.
The columns Attrition_Flag, Gender, Education_Level, Marital_Status, Income_Category, and Card_Category are all of type Object, which is often used for categorical. Convert these to the Category datatype
# Misisng Values
df.isnull().sum()
CLIENTNUM 0 Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
round(df.isnull().sum() / df.isnull().count() * 100, 2)
CLIENTNUM 0.000 Attrition_Flag 0.000 Customer_Age 0.000 Gender 0.000 Dependent_count 0.000 Education_Level 15.000 Marital_Status 7.400 Income_Category 0.000 Card_Category 0.000 Months_on_book 0.000 Total_Relationship_Count 0.000 Months_Inactive_12_mon 0.000 Contacts_Count_12_mon 0.000 Credit_Limit 0.000 Total_Revolving_Bal 0.000 Avg_Open_To_Buy 0.000 Total_Amt_Chng_Q4_Q1 0.000 Total_Trans_Amt 0.000 Total_Trans_Ct 0.000 Total_Ct_Chng_Q4_Q1 0.000 Avg_Utilization_Ratio 0.000 dtype: float64
# Duplicates
df.duplicated().sum()
0
Insight
df.drop(["CLIENTNUM"], axis=1, inplace=True)
data = df.copy()
# Data summary
data.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Customer_Age | 10127.000 | NaN | NaN | NaN | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Gender | 10127 | 2 | F | 5358 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Dependent_count | 10127.000 | NaN | NaN | NaN | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Education_Level | 8608 | 6 | Graduate | 3128 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Marital_Status | 9378 | 3 | Married | 4687 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Income_Category | 10127 | 6 | Less than $40K | 3561 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Card_Category | 10127 | 4 | Blue | 9436 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Months_on_book | 10127.000 | NaN | NaN | NaN | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | NaN | NaN | NaN | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | NaN | NaN | NaN | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | NaN | NaN | NaN | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | NaN | NaN | NaN | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | NaN | NaN | NaN | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | NaN | NaN | NaN | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | NaN | NaN | NaN | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | NaN | NaN | NaN | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | NaN | NaN | NaN | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | NaN | NaN | NaN | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | NaN | NaN | NaN | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
data.describe()
| Customer_Age | Dependent_count | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 | 10127.000 |
| mean | 46.326 | 2.346 | 35.928 | 3.813 | 2.341 | 2.455 | 8631.954 | 1162.814 | 7469.140 | 0.760 | 4404.086 | 64.859 | 0.712 | 0.275 |
| std | 8.017 | 1.299 | 7.986 | 1.554 | 1.011 | 1.106 | 9088.777 | 814.987 | 9090.685 | 0.219 | 3397.129 | 23.473 | 0.238 | 0.276 |
| min | 26.000 | 0.000 | 13.000 | 1.000 | 0.000 | 0.000 | 1438.300 | 0.000 | 3.000 | 0.000 | 510.000 | 10.000 | 0.000 | 0.000 |
| 25% | 41.000 | 1.000 | 31.000 | 3.000 | 2.000 | 2.000 | 2555.000 | 359.000 | 1324.500 | 0.631 | 2155.500 | 45.000 | 0.582 | 0.023 |
| 50% | 46.000 | 2.000 | 36.000 | 4.000 | 2.000 | 2.000 | 4549.000 | 1276.000 | 3474.000 | 0.736 | 3899.000 | 67.000 | 0.702 | 0.176 |
| 75% | 52.000 | 3.000 | 40.000 | 5.000 | 3.000 | 3.000 | 11067.500 | 1784.000 | 9859.000 | 0.859 | 4741.000 | 81.000 | 0.818 | 0.503 |
| max | 73.000 | 5.000 | 56.000 | 6.000 | 6.000 | 6.000 | 34516.000 | 2517.000 | 34516.000 | 3.397 | 18484.000 | 139.000 | 3.714 | 0.999 |
Insight
Attrition flag
Gender
Education level
Marital status
Income category
Card category
CLIENTNUM
Customer_Age
Dependent_count
Months_on_book
Total_Relationship_Count
Months_Inactive_12_mon
Contacts_Count_12_mon
Credit_Limit
Total_Amt_Chng_Q4_Q1
Total_Trans_Amt
Total_Ct_Chng_Q4_Q1
Avg_Utilization_Ratio
Based on the business problem, domain knowledge, data summary and the initial hypothesis. Here is what we can understand better when completing the EDA process:
Variability: There's substantial variation in customer behavior across age, credit limits, spending patterns, and relationship with the bank. This suggests we may need to segment customers for more targeted analysis.
Potential Risk: High average utilization ratios and a wide range of credit limits suggest some customers might be at risk of overspending or financial stress, which could contribute to churn.
Relationship Impact: The data doesn't directly reveal the impact of relationship length or product count on churn. Further analysis is needed to understand these relationships.
Outliers: Some outliers exist, especially in credit limits and transaction amounts. These should be investigated to determine if they are errors or represent specific customer segments.
Having established a foundational understanding of the data and the business problem, we'll now conduct univariate and bivariate analyses. This will help us uncover relationships within the data and guide us in accurately addressing missing values before proceeding to model development.
Questions:
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
# Function to plot stacked bar chart with percentages
def stacked_barplot_perc(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart with percentages.
data: DataFrame
predictor: Independent variable (column name)
target: Target variable (column name)
"""
count = data[predictor].nunique() # Number of unique categories in predictor
# Sorting for stacked bar chart
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(by=sorter, ascending=False)
print(tab1)
# Crosstab for percentage calculations
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(by=sorter, ascending=False)
# Plotting
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.xlabel(predictor, fontsize=12)
plt.ylabel('Percentage', fontsize=12)
plt.title(f'Proportion of {target} by {predictor}', fontsize=14) # Add a descriptive title
plt.legend(title=target, loc="upper left", bbox_to_anchor=(1, 1), frameon=False)
# Add percentage labels inside bars
for n, x in enumerate([*tab.index.values]): # Iterate over each bar (predictor category)
for (proportion, y_loc) in zip(tab.loc[x], tab.loc[x].cumsum()):
if proportion != 0:
plt.text(
x=n - 0.17,
y=(y_loc - proportion) + (proportion / 2), # Center the text within the bar segment
s=f'{proportion:.1%}', # Format as percentage with one decimal place
color="white",
fontsize=10,
)
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# Review data
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(9), object(6) memory usage: 1.5+ MB
all client ids are accounted for, this column does not need to be further analyzed and dropped.
# Age
histogram_boxplot(data, 'Customer_Age')
# break down of age
labeled_barplot(data, 'Customer_Age', perc=True)
Insight
Normal distribution of data amoung ages, ages 26 and 65 having a higher peak than surrounding data
Highest percentage for ages 44 and 49 with 4.9%
Ages 66 to 73 have 0.0% needs further analysis
Outliers present on upper right side
# Boxplot for Dependent count
histogram_boxplot(data, 'Dependent_count')
# dependent count breakdown
labeled_barplot(data, 'Dependent_count', perc = True)
Insight
# Months on book
histogram_boxplot(data, 'Months_on_book')
# break down on months on book
labeled_barplot(data, 'Months_on_book', perc=True)
Insight
Futher Insight
## Observation on Total_Relationship_Count
histogram_boxplot(data, 'Total_Relationship_Count')
# breakdown by percentage
labeled_barplot(data, 'Total_Relationship_Count', perc=True)
Insight
Further insight
# Observation on Months_Inactive_12_mon
histogram_boxplot(data, 'Months_Inactive_12_mon')
# Break down of Months
labeled_barplot(data, 'Months_Inactive_12_mon', perc=True)
Insight
Further insight
# Observation Contacts_Count_12_mon
histogram_boxplot(data, 'Contacts_Count_12_mon')
# break down of contact
labeled_barplot(data, 'Contacts_Count_12_mon', perc=True)
Insight
# Observation on Credit_Limit
histogram_boxplot(data, 'Credit_Limit')
# Breakdown of credit limit for customers
bin_edges = [0, 5000, 10000, 15000, 20000, 25000, 30000, 35000]
bin_labels = ['0-5k', '5k-10k', '10k-15k', '15k-20k', '20k-25k', '25k-30k', '30k+']
# Create a New Column for Bins
data['Credit_Limit_Bin'] = pd.cut(data['Credit_Limit'], bins=bin_edges, labels=bin_labels, right=False)
# Calculate Percentage per Bin and Sort
binned_data = data['Credit_Limit_Bin'].value_counts(normalize=True).mul(100).reset_index(name='Percentage').rename(columns={'index': 'Credit_Limit_Bin'}).sort_values(by='Credit_Limit_Bin')
# Plotting the Barplot
sns.set(style="whitegrid")
plt.figure(figsize=(10, 6))
sns.barplot(x='Credit_Limit_Bin', y='Percentage', data=binned_data)
plt.title('Distribution of Credit Limits', fontsize=14)
plt.xlabel('Credit Limit Range', fontsize=12)
plt.ylabel('Percentage of Customers (%)', fontsize=12)
plt.show()
Insight
Further insight
# Observation on Total_Revolving_Bal
histogram_boxplot(data, 'Total_Revolving_Bal')
Insight
# Observation on Avg_Open_To_Buy
histogram_boxplot(data, 'Avg_Open_To_Buy')
Insight
Further analysis
# Pull up info
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 20 Credit_Limit_Bin 10127 non-null category dtypes: category(1), float64(5), int64(9), object(6) memory usage: 1.6+ MB
# Observation on Total_Amt_Chng_Q4_Q1
histogram_boxplot(data, 'Total_Amt_Chng_Q4_Q1')
# Define Bin Edges
bin_edges = [0, 0.5, 1.0, 1.5, 2.0, 2.5, 3.0]
bin_labels = ['0-0.5', '0.5-1.0', '1.0-1.5', '1.5-2.0', '2.0-2.5', '2.5-3.0']
# Create a New Column for Bins
data['Total_Amt_Chng_Q4_Q1_Binned'] = pd.cut(data['Total_Amt_Chng_Q4_Q1'], bins=bin_edges, labels=bin_labels)
# Plot the Histogram with Bins
plt.figure(figsize=(12, 6))
sns.histplot(data=data, x='Total_Amt_Chng_Q4_Q1_Binned', kde=False) # Removed kde for clarity
plt.title('Distribution of Change in Transaction Amount (Q4 over Q1) - Binned', fontsize=14)
plt.xlabel('Change in Transaction Amount (Binned)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=45)
plt.show()
Insight
# Obersvation on total transaction amount
histogram_boxplot(data, 'Total_Trans_Amt')
Insight
# Observation on Total_Trans_Ct
histogram_boxplot(data, 'Total_Trans_Ct')
Insight
# Observation on Total_Ct_Chng_Q4_Q1
histogram_boxplot(data, 'Total_Ct_Chng_Q4_Q1')
Insight
#Observation on Avg_Utilization_Ratio
histogram_boxplot(data, 'Avg_Utilization_Ratio')
Insight
# Pull up data
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 20 Credit_Limit_Bin 10127 non-null category 21 Total_Amt_Chng_Q4_Q1_Binned 10120 non-null category dtypes: category(2), float64(5), int64(9), object(6) memory usage: 1.6+ MB
# Countplot for attrition Flag
labeled_barplot(data, 'Attrition_Flag', perc = True)
Insight
Further note
This will be the target variable and must account for the class imbalance when building models.
# Countplot for Gender
labeled_barplot(data, 'Gender', perc = True)
Insight
# Count plot for education level
labeled_barplot(data, 'Education_Level', perc = True)
data['Education_Level'].value_counts().sum()
8608
Insight
Graduates account for the most frequent count 38.9%, followed by high school 19.9% and uneducated for 14.7%.
Lowest percentage is Doctorate and Post-Graduate
# Count plot for Marital status
labeled_barplot(data, 'Marital_Status', perc =True)
# Missing values count
data['Marital_Status'].value_counts().sum()
9378
Insight
# income category values
data['Income_Category'].unique()
array(['$60K - $80K', 'Less than $40K', '$80K - $120K', '$40K - $60K',
'$120K +', 'abc'], dtype=object)
#Change abc to unknown for income category
data['Income_Category'] = data['Income_Category'].replace('abc', 'Unknown')
Data is unclear why abc is a category in income_category. Cannot assume it is supposed to be 100K to 120K as that is missing as a potential category. Changing to unknown to keep data. No reason to believe it should be further manipulated without further confirmation.
# Income category
labeled_barplot(data, 'Income_Category', perc = True);
Insight
#Coutnplot on card category
labeled_barplot(data, 'Card_Category', perc = True)
Insight
Further analysis
For bivariate analysis in the Thera Bank customer dataset, we should analyze relationships between the target variable (Attrition_Flag) and various predictor variables to identify potential factors influencing customer churn.
Demographic Factors:
Relationship Factors:
Numerical Predictor Variables:
Credit Card Usage:
# Get only the numerical columns
numeric_columns = data.select_dtypes(include=['int64', 'float64'])
# Calculate the correlation matrix for the numerical columns
correlation_matrix = numeric_columns.corr()
# Create the heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(correlation_matrix, annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.title('Correlation Heatmap for Numerical Columns')
plt.show()
Insight
# Gender & target
stacked_barplot_perc(data, 'Gender', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769
Insight
# Education level & target
stacked_barplot_perc(data, 'Education_Level','Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516
Insight
The churn rate appears relatively consistent across all educational levels. Those in high school and college have the higest attrition rate.
Whereas, those with higher education seems to have lower attrition rates.
# marital status & target
stacked_barplot_perc(data, 'Marital_Status', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748
# Income category and target
stacked_barplot_perc(data, 'Income_Category', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 Unknown 187 925 1112 $120K + 126 601 727
Insight
# Card category and target with
stacked_barplot_perc(data, 'Card_Category', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20
Insight
# Total_Relationship_Count & target
stacked_barplot_perc(data, 'Total_Relationship_Count', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Total_Relationship_Count All 1627 8500 10127 3 400 1905 2305 2 346 897 1243 1 233 677 910 5 227 1664 1891 4 225 1687 1912 6 196 1670 1866
Insight
# Months_Inactive_12_mon & target
stacked_barplot_perc(data, 'Months_Inactive_12_mon', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Months_Inactive_12_mon All 1627 8500 10127 3 826 3020 3846 2 505 2777 3282 4 130 305 435 1 100 2133 2233 5 32 146 178 6 19 105 124 0 15 14 29
distribution_plot_wrt_target(data,'Months_Inactive_12_mon', 'Attrition_Flag')
Insight
As customers become inactive longer the higher the the chance for them closing thier accounts.
New customers have the lowest attribution rate.
# Contacts_Count_12_mon & target
stacked_barplot_perc(data, 'Contacts_Count_12_mon', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Contacts_Count_12_mon All 1627 8500 10127 3 681 2699 3380 2 403 2824 3227 4 315 1077 1392 1 108 1391 1499 5 59 117 176 6 54 0 54 0 7 392 399
Insight
Customers who contacted the bank more frequently in the past 12 months are less likely to have churned. And those who did not contact the bank within 12 months have the highest attrition rate.
6 months does not have any existing customers making it 100% attrition rate. this months could be a cricial month to consider.
# Customer age & Target
distribution_plot_wrt_target(data, 'Customer_Age', 'Attrition_Flag')
# Customers Age & target
stacked_barplot_perc(data, 'Customer_Age', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Customer_Age All 1627 8500 10127 43 85 388 473 48 85 387 472 44 84 416 500 46 82 408 490 45 79 407 486 49 79 416 495 47 76 403 479 41 76 303 379 50 71 381 452 54 69 238 307 40 64 297 361 42 62 364 426 53 59 328 387 52 58 318 376 51 58 340 398 55 51 228 279 39 48 285 333 38 47 256 303 56 43 219 262 59 40 117 157 37 37 223 260 57 33 190 223 58 24 133 157 36 24 197 221 35 21 163 184 33 20 107 127 34 19 127 146 32 17 89 106 61 17 76 93 62 17 76 93 30 15 55 70 31 13 78 91 60 13 114 127 65 9 92 101 63 8 57 65 29 7 49 56 26 6 72 78 64 5 38 43 27 3 29 32 28 1 28 29 66 1 1 2 68 1 1 2 67 0 4 4 70 0 1 1 73 0 1 1
Insight
Customers aged 68 and 66 have the lowest attrition rate
Data suggests, 40 years old age range has the highest attrition account
Customers aged 67, 70 and 73 have no customers who have closed their accounts.
# Depdendent count & target
stacked_barplot_perc(data, 'Dependent_count', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Dependent_count All 1627 8500 10127 3 482 2250 2732 2 417 2238 2655 1 269 1569 1838 4 260 1314 1574 0 135 769 904 5 64 360 424
Insight
# Months on Book & Target
distribution_plot_wrt_target(data, 'Months_on_book', 'Attrition_Flag')
# Months on book & Atrritubed
stacked_barplot_perc(data, 'Months_on_book', 'Attrition_Flag')
Attrition_Flag Attrited Customer Existing Customer All Months_on_book All 1627 8500 10127 36 430 2033 2463 39 64 277 341 37 62 296 358 30 58 242 300 38 57 290 347 34 57 296 353 41 51 246 297 33 48 257 305 40 45 288 333 35 45 272 317 32 44 245 289 28 43 232 275 44 42 188 230 43 42 231 273 46 36 161 197 42 36 235 271 29 34 207 241 31 34 284 318 45 33 194 227 25 31 134 165 24 28 132 160 48 27 135 162 50 25 71 96 49 24 117 141 26 24 162 186 47 24 147 171 27 23 183 206 22 20 85 105 56 17 86 103 51 16 64 80 18 13 45 58 20 13 61 74 52 12 50 62 23 12 104 116 21 10 73 83 15 9 25 34 53 7 71 78 13 7 63 70 19 6 57 63 54 6 47 53 17 4 35 39 55 4 38 42 16 3 26 29 14 1 15 16
Insight
Top months with highest attrition rates are in the 30-month range, with 36th being the highest.
This could be due to a certain product expiring at 30-36 months, housing, car or loan product related.
Futher analysis
# Credit Limit & target
distribution_plot_wrt_target(data, 'Credit_Limit', 'Attrition_Flag')
Insight
# Total revolving balance & target
distribution_plot_wrt_target(data, 'Total_Revolving_Bal', 'Attrition_Flag')
Insight
# Avg_Open_To_Buy & target
distribution_plot_wrt_target(data, 'Avg_Open_To_Buy', 'Attrition_Flag')
Insight
# Total_Amt_Chng_Q4_Q1 & target
distribution_plot_wrt_target(data, 'Total_Amt_Chng_Q4_Q1', 'Attrition_Flag')
Insight
# Total_Trans_Amt & target
distribution_plot_wrt_target(data, 'Total_Trans_Amt', 'Attrition_Flag')
Insight
Majority of existing customers having transaction amounts between \$2,000 and \$5,000. Second peak in the distribution around \$10,000.
Existing customers have more higher spending customers.
Outliers present in both, and right skewed.
# Total_Ct_Chng_Q4_Q1 & target
distribution_plot_wrt_target(data, 'Total_Ct_Chng_Q4_Q1', 'Attrition_Flag')
Insight
# Avg_Utilization_Ratio & target
distribution_plot_wrt_target(data, 'Avg_Utilization_Ratio', 'Attrition_Flag')
Insight
most customers in both groups use a relatively small portion of their available credit.
Many Outlier present in the attributed customers
How is the total transaction amount distributed?
What is the distribution of the level of education of customers?
How does the change in transaction amount between Q4 and Q1 (total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?
How does the number of months a customer was inactive in the last 12 months (Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?
What are the attributes that have a strong correlation with each other?
# outlier detection using boxplot
numeric_columns = data.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(15, 12))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Insight
df1 = data.copy()
# Drop Total_Amt_Chng_Q4_Q1_Binned column (it is not needed)
df1 = df1.drop(['Total_Amt_Chng_Q4_Q1_Binned'], axis=1)
# Drop Credit_Limit_Bin (it is not needed)
df1 = df1.drop(['Credit_Limit_Bin'], axis=1)
X = df1.drop(["Attrition_Flag"], axis=1)
y = df1["Attrition_Flag"]
# Splitting data into training, validation and test set:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
print("Number of rows in train data =", X_train.shape[0])
print("Number of rows in validation data =", X_val.shape[0])
print("Number of rows in test data =", X_test.shape[0])
Number of rows in train data = 6075 Number of rows in validation data = 2026 Number of rows in test data = 2026
Reason why:
To build machine learning models that are more robust, generalizable, and ultimately, more useful for solving real-world problems.
Prevent overfitting
Model has not seen the data and I can check the more generalization approach of the model.
Avoid data leakage
df1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(9), object(6) memory usage: 1.5+ MB
# Define a dictionary to map class labels to 1 or 0
attrition_map = {'Attrited Customer': 1, 'Existing Customer': 0}
# Apply the mapping and convert to integer
y_train = y_train.map(attrition_map).astype(int)
y_val = y_val.map(attrition_map).astype(int)
y_test = y_test.map(attrition_map).astype(int)
# Define a dictionary to map 'M' to 1 and 'F' to 0
gender_map = {'M': 1, 'F': 0}
# Apply the mapping and convert to numeric
data['Gender'] = data['Gender'].map(gender_map).astype(int)
Note
Originally I did not change attrition customer and existing customer to numerical representation but came across issues with the pos_label in the model.
Reason for handling missing values after splitting the data
Data leakage prevention
Realistic evaluation
Consistent statistics
# Checking columns for missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 897 Marital_Status 432 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 308 Marital_Status 168 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 314 Marital_Status 149 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
import pandas as pd
from sklearn.impute import SimpleImputer
# Get list of categorical and numerical columns
cat_cols = list(X_train.select_dtypes(include='object').columns)
num_cols = list(X_train.select_dtypes(include=['int', 'float']).columns)
# Impute categorical columns
cat_imputer = SimpleImputer(strategy='most_frequent')
X_train[cat_cols] = cat_imputer.fit_transform(X_train[cat_cols])
X_val[cat_cols] = cat_imputer.transform(X_val[cat_cols])
X_test[cat_cols] = cat_imputer.transform(X_test[cat_cols])
# Impute numerical columns
num_imputer = SimpleImputer(strategy='mean')
X_train[num_cols] = num_imputer.fit_transform(X_train[num_cols])
X_val[num_cols] = num_imputer.transform(X_val[num_cols])
X_test[num_cols] = num_imputer.transform(X_test[num_cols])
# Checking that no column has missing values in train, validation or test sets
print(X_train.isna().sum())
print("-" * 30)
print(X_val.isna().sum())
print("-" * 30)
print(X_test.isna().sum())
Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64 ------------------------------ Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 0 Marital_Status 0 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
All missing vlaues have been treated
cols = X_train.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_train[i].value_counts())
print("*" * 30)
Gender F 3205 M 2870 Name: count, dtype: int64 ****************************** Education_Level Graduate 2763 High School 1212 Uneducated 928 College 598 Post-Graduate 299 Doctorate 275 Name: count, dtype: int64 ****************************** Marital_Status Married 3280 Single 2346 Divorced 449 Name: count, dtype: int64 ****************************** Income_Category Less than $40K 2130 $40K - $60K 1076 $80K - $120K 918 $60K - $80K 847 Unknown 668 $120K + 436 Name: count, dtype: int64 ****************************** Card_Category Blue 5668 Silver 327 Gold 71 Platinum 9 Name: count, dtype: int64 ******************************
cols = X_val.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_val[i].value_counts())
print("*" * 30)
Gender F 1106 M 920 Name: count, dtype: int64 ****************************** Education_Level Graduate 935 High School 399 Uneducated 278 College 217 Post-Graduate 108 Doctorate 89 Name: count, dtype: int64 ****************************** Marital_Status Married 1057 Single 814 Divorced 155 Name: count, dtype: int64 ****************************** Income_Category Less than $40K 712 $40K - $60K 340 $80K - $120K 306 $60K - $80K 280 Unknown 248 $120K + 140 Name: count, dtype: int64 ****************************** Card_Category Blue 1893 Silver 108 Gold 19 Platinum 6 Name: count, dtype: int64 ******************************
cols = X_test.select_dtypes(include=["object", "category"])
for i in cols.columns:
print(X_test[i].value_counts())
print("*" * 30)
Gender F 1047 M 979 Name: count, dtype: int64 ****************************** Education_Level Graduate 949 High School 402 Uneducated 281 College 198 Post-Graduate 109 Doctorate 87 Name: count, dtype: int64 ****************************** Marital_Status Married 1099 Single 783 Divorced 144 Name: count, dtype: int64 ****************************** Income_Category Less than $40K 719 $40K - $60K 374 $80K - $120K 311 $60K - $80K 275 Unknown 196 $120K + 151 Name: count, dtype: int64 ****************************** Card_Category Blue 1875 Silver 120 Gold 26 Platinum 5 Name: count, dtype: int64 ******************************
X_train = pd.get_dummies(X_train, drop_first=True)
X_val = pd.get_dummies(X_val, drop_first=True)
X_test = pd.get_dummies(X_test, drop_first=True)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 30) (2026, 30) (2026, 30)
After encoding there are 30 columns.
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{"Accuracy": acc, "Recall": recall, "Precision": precision, "F1": f1,},
index=[0],
)
return df_perf
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
The nature of predictions made by the classification model will translate as follows:
Potential types of loses
Which loss is greater?
The loss associated with false negatives (losing customers) far outweighs the loss associated with false positives (unnecessary retention efforts).
Primary Metric: Recall (Sensitivity)
Formula: Recall = TP / (TP + FN)
Business Justification: The bank's priority is to identify as many potential customer churn cases as possible (minimize false negatives). Missing a customer who is likely to leave is costly due to lost revenue and the potential expense of reacquiring them. Recall directly measures the model's ability to capture these at-risk customers. Also, it is more cost effective to retain existing customers than acquiring new customers.
Additonal:
F1 score. The F1 score is the harmonic mean of precision and recall. It's a good metric to consider when you want to balance both minimizing false negatives and false positives.
Precision-Recall Trade-off: While recall is crucial, banks also need to consider precision (the accuracy of positive predictions). A model with low precision might lead to unnecessary and costly retention efforts on customers who weren't actually going to churn. Therefore, finding the right balance between recall and precision is important.
Reason for building initial models
Why these models?
Decision Tree (dtree), a good starting point due to its simplicity and interpretability.
Bagging (BaggingClassifier), since Decision Trees are prone to overfitting, Bagging can be a good way to improve their generalization performance.
Random Forest (RandomForestClassifier), often a top performer on a variety of tasks, including classification problems like customer churn.
GBM (GradientBoostingClassifier), given the potential complexity of the customer churn problem, GBM is a good candidate to see if it can outperform simpler models.
AdaBoost (AdaBoostClassifier), worth trying AdaBoost to see if it can provide a different performance profile compared to GBM.
Why class_weight balanced?
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
# Train and evaluate the models
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores_train = recall_score(y_train, model.predict(X_train))
print("{}: {}".format(name, scores_train))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
scores_val = recall_score(y_val, model.predict(X_val))
print("{}: {}".format(name, scores_val))
Training Performance: Bagging: 0.9846311475409836 Random forest: 1.0 GBM: 0.9108606557377049 Adaboost: 0.8504098360655737 dtree: 1.0 Validation Performance: Bagging: 0.7576687116564417 Random forest: 0.6625766871165644 GBM: 0.8190184049079755 Adaboost: 0.8128834355828221 dtree: 0.7300613496932515
Insight
Best Model: Bagging and GBM
Reasoning: Both models achieved a high recall on the validation set (0.8037 and 0.8190, respectively), indicating their ability to identify a good proportion of customers likely to churn.
Reason for this approach
# Print counts before oversampling
print("Before Oversampling, counts of label 'Attrited Customer': {}".format(sum(y_train == 1)))
print("Before Oversampling, counts of label 'Existing Customer': {} \n".format(sum(y_train == 0)))
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
# Print counts after oversampling
print("After Oversampling, counts of label 'Attrited Customer': {}".format(sum(y_train_over == 1)))
print("After Oversampling, counts of label 'Existing Customer': {} \n".format(sum(y_train_over == 0)))
print("After Oversampling, the shape of train_X: {}".format(X_train_over.shape))
print("After Oversampling, the shape of train_y: {} \n".format(y_train_over.shape))
Before Oversampling, counts of label 'Attrited Customer': 976 Before Oversampling, counts of label 'Existing Customer': 5099 After Oversampling, counts of label 'Attrited Customer': 5099 After Oversampling, counts of label 'Existing Customer': 5099 After Oversampling, the shape of train_X: (10198, 30) After Oversampling, the shape of train_y: (10198,)
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
# Training Performance
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores = recall_score(y_train_over, model.predict(X_train_over)) # Specify pos_label
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
scores_val = recall_score(y_val, model.predict(X_val)) # Specify pos_label
print("{}: {}".format(name, scores_val))
Training Performance: Bagging: 0.9972543636007061 Random forest: 1.0 GBM: 0.9847028829182193 Adaboost: 0.9731319866640518 dtree: 1.0 Validation Performance: Bagging: 0.8282208588957055 Random forest: 0.7791411042944786 GBM: 0.8680981595092024 Adaboost: 0.8466257668711656 dtree: 0.803680981595092
Insight
Best Model: Bagging and GBM
Reasoning: These models again showed the best recall on the validation set (0.8742 and 0.8650), even after oversampling. This indicates that they can handle balanced data effectively while maintaining good performance on imbalanced data.
Reason for use?
The dataset is imbalanced with significantly more "Existing Customers" than "Attrited Customers". This is crucial because machine learning models tend to be biased towards the majority class in imbalanced datasets, leading to poor performance in identifying the minority class ("Attrited Customers").
# Print counts before undersampling
print("Before Undersampling, counts of label 'Attrited Customer': {}".format(sum(y_train == 1)))
print("Before Undersampling, counts of label 'Existing Customer': {} \n".format(sum(y_train == 0)))
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
# Print counts after undersampling
print("After Undersampling, counts of label 'Attrited Customer': {}".format(sum(y_train_un == 1)))
print("After Undersampling, counts of label 'Existing Customer': {} \n".format(sum(y_train_un == 0)))
print("After Undersampling, the shape of train_X: {}".format(X_train_un.shape))
print("After Undersampling, the shape of train_y: {} \n".format(y_train_un.shape))
Before Undersampling, counts of label 'Attrited Customer': 976 Before Undersampling, counts of label 'Existing Customer': 5099 After Undersampling, counts of label 'Attrited Customer': 976 After Undersampling, counts of label 'Existing Customer': 976 After Undersampling, the shape of train_X: (1952, 30) After Undersampling, the shape of train_y: (1952,)
from sklearn.metrics import recall_score
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(base_estimator=DecisionTreeClassifier(random_state=1, class_weight='balanced'), random_state=1)))
models.append(("Random forest", RandomForestClassifier(random_state=1, class_weight='balanced')))
models.append(("GBM", GradientBoostingClassifier(random_state=1)))
models.append(("Adaboost", AdaBoostClassifier(random_state=1)))
models.append(("dtree", DecisionTreeClassifier(random_state=1, class_weight='balanced')))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores = recall_score(y_train_un, model.predict(X_train_un)) # pos_label added
print("{}: {}".format(name, scores))
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train_un, y_train_un)
scores_val = recall_score(y_val, model.predict(X_val)) # pos_label added
print("{}: {}".format(name, scores_val))
Training Performance: Bagging: 0.9907786885245902 Random forest: 1.0 GBM: 0.9784836065573771 Adaboost: 0.951844262295082 dtree: 1.0 Validation Performance: Bagging: 0.8650306748466258 Random forest: 0.9079754601226994 GBM: 0.9294478527607362 Adaboost: 0.9294478527607362 dtree: 0.8650306748466258
Insight
Best Models: Bagging, Random Forest, GBM, and AdaBoost
Reasoning: All four models achieved very high recall scores on both the training and validation sets (above 0.90). While the high training recall suggests potential overfitting, the excellent validation recall indicates these models generalize well to imbalanced data.
Oversampling vs. Undersampling: Both techniques improved validation recall compared to the original models, indicating they were effective in addressing the class imbalance.
Overall, the models trained on the undersampled data generally outperformed the oversampled models. The Bagging, Random Forest, GBM, and AdaBoost models trained on the undersampled data show the highest validation recall scores and less severe overfitting.
However, it's important to note that the high training recall scores for the undersampled models indicate some degree of overfitting, which need further refinement.
Reason for use?
Hyperparameter tuning can help you find the optimal configuration for each model (Bagging, Random Forest, GBM, AdaBoost, Decision Tree) to improve its ability to predict churn accurately.
It can help reduce overfitting, which is evident in the current results the models perform much better on the training data than on the validation data.
Reason for use?
AdaBoost is known for its sensitivity to class imbalances. By providing it with a balanced dataset through undersampling, we can enhance its ability to learn patterns from both the "Attrited" and "Existing" classes more effectively.
%%time
import sklearn.metrics as metrics
# defining model
Model = AdaBoostClassifier(random_state=1)
# Parameter grid to pass in RandomSearchCV
param_grid = {
"n_estimators": np.arange(10, 40, 10),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"base_estimator": [
DecisionTreeClassifier(max_depth=1, random_state=1),
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_jobs = -1, n_iter=50, scoring=scorer, cv=5, random_state=1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un, y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'n_estimators': 30, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)} with CV score=0.9487650444793303:
CPU times: user 1.55 s, sys: 146 ms, total: 1.69 s
Wall time: 30.9 s
Insight into model's best parameters
n_estimators: 30 - This means the model uses 30 decision trees as base learners.
learning_rate: 0.1 - This controls how much each tree contributes to the overall model. A lower learning rate means each tree has a smaller impact, which can help prevent overfitting but might require more trees to achieve good performance.
base_estimator: DecisionTreeClassifier(max_depth=3, random_state=1) - This specifies that the base learners are decision trees with a maximum depth of 3. A smaller depth helps control the complexity of each tree, again reducing the risk of overfitting.
tuned_adb = AdaBoostClassifier(
random_state=1,
n_estimators=30,
learning_rate=0.1,
base_estimator=DecisionTreeClassifier(max_depth=3, random_state=1),
)
tuned_adb.fit(X_train_un, y_train_un)
AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=30, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=3,
random_state=1),
learning_rate=0.1, n_estimators=30, random_state=1)DecisionTreeClassifier(max_depth=3, random_state=1)
DecisionTreeClassifier(max_depth=3, random_state=1)
# Checking model's performance on training set
adb_train = model_performance_classification_sklearn(tuned_adb, X_train_un, y_train_un)
adb_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.953 | 0.964 | 0.943 | 0.953 |
Insight
Model performs well on the training set, achieving high accuracy for all measurements.
Meaning model can accurately identify customers who have churned within training set
# Checking model's performance on validation set
adb_val = model_performance_classification_sklearn(tuned_adb, X_val, y_val)
adb_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.927 | 0.933 | 0.709 | 0.805 |
Address overfitting
Reason for use?
Gradient Boosting is a powerful machine learning technique used for classification tasks.
Undersampling is a technique for dealing with class imbalance in datasets, where one class has significantly more samples than the others. In the context of customer churn, existing Customers usually far outnumber Attrited Customers
%%time
#Creating pipeline
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(125,175,25),
"learning_rate": [0.01, 0.2, 0.05, 1],
"subsample":[0.8,0.9,1],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.9, 'n_estimators': 150, 'max_features': 0.5, 'learning_rate': 0.05, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9538932496075354:
CPU times: user 2.96 s, sys: 251 ms, total: 3.21 s
Wall time: 2min 8s
Insight in model's best parameters
tuned_gbm1 = GradientBoostingClassifier(
random_state=1,
subsample=0.9,
n_estimators=150,
max_features=0.5,
learning_rate=0.05,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm1.fit(X_train_un, y_train_un)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.05, max_features=0.5,
n_estimators=150, random_state=1, subsample=0.9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.05, max_features=0.5,
n_estimators=150, random_state=1, subsample=0.9)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Checking model's performance on training set
gbm1_train = model_performance_classification_sklearn(
tuned_gbm1, X_train_un, y_train_un
)
gbm1_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.966 | 0.975 | 0.958 | 0.966 |
Insight
# Checking model's performance on validation set
gbm1_val = model_performance_classification_sklearn(tuned_gbm1, X_val, y_val)
gbm1_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.942 | 0.914 | 0.770 | 0.836 |
Insight
Reason for use?
Potentially build a more accurate, effective, and business-oriented churn prediction model that prioritizes the identification of at-risk customers.
Recall improvements as the model can use oversampling to boost the model's recall.
Enhanced learning of the model
%%time
#defining model
Model = GradientBoostingClassifier(random_state=1)
#Parameter grid to pass in RandomSearchCV
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(75,150,25),
"learning_rate": [0.1, 0.01, 0.2, 0.05, 1],
"subsample":[0.5,0.7,1],
"max_features":[0.5,0.7,1],
}
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)
#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=50, scoring=scorer, cv=5, random_state=1, n_jobs = -1)
#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over, y_train_over)
print("Best parameters are {} with CV score={}:" .format(randomized_cv.best_params_,randomized_cv.best_score_))
Best parameters are {'subsample': 0.5, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.01, 'init': AdaBoostClassifier(random_state=1)} with CV score=0.9541109122746253:
CPU times: user 6.86 s, sys: 659 ms, total: 7.52 s
Wall time: 6min 44s
Insight in model's best parameters
subsample: 0.5 - This introduces randomness and helps prevent overfitting by reducing the variance of the model.
n_estimators: 100 - Increasing the number of estimators can improve model performance.
max_features: 0.7 - This randomness helps to reduce correlation between the trees, further preventing overfitting and improving generalization.
learning_rate: 0.01 - This is a small learning rate, which means each tree contributes a small amount to the overall model's predictions.
tuned_gbm2 = GradientBoostingClassifier(
random_state=1,
subsample=0.5,
n_estimators=100,
max_features=0.7,
learning_rate=0.01,
init=AdaBoostClassifier(random_state=1),
)
tuned_gbm2.fit(X_train_over, y_train_over)
GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.01, max_features=0.7, random_state=1,
subsample=0.5)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GradientBoostingClassifier(init=AdaBoostClassifier(random_state=1),
learning_rate=0.01, max_features=0.7, random_state=1,
subsample=0.5)AdaBoostClassifier(random_state=1)
AdaBoostClassifier(random_state=1)
# Checking model's performance on training set
gbm2_train = model_performance_classification_sklearn(tuned_gbm2, X_train_over, y_train_over)
gbm2_train
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.934 | 0.963 | 0.909 | 0.935 |
Insight
# Checking model's performance on validation set
gbm2_val = model_performance_classification_sklearn(tuned_gbm2, X_val, y_val)
gbm2_val
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.914 | 0.899 | 0.674 | 0.770 |
Insight
Reason for comparison?
Different machine learning algorithms have different strengths and weaknesses.
Optimizing for business goals
Avoiding overfitting
Robustness and reliability
# training performance comparison
models_train_comp_df = pd.concat(
[
gbm1_train.T,
gbm2_train.T,
adb_train.T,
],
axis=1,
)
models_train_comp_df.columns = [
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Oversampled data",
"AdaBoost trained with Undersampled data",
]
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
| Gradient boosting trained with Undersampled data | Gradient boosting trained with Oversampled data | AdaBoost trained with Undersampled data | |
|---|---|---|---|
| Accuracy | 0.966 | 0.934 | 0.953 |
| Recall | 0.975 | 0.963 | 0.964 |
| Precision | 0.958 | 0.909 | 0.943 |
| F1 | 0.966 | 0.935 | 0.953 |
# Validation performance comparison
models_train_comp_df = pd.concat(
[ gbm1_val.T, gbm2_val.T, adb_val.T], axis=1,
)
models_train_comp_df.columns = [
"Gradient boosting trained with Undersampled data",
"Gradient boosting trained with Oversampled data",
"AdaBoost trained with Undersampled data",
]
print("Validation performance comparison:")
models_train_comp_df
Validation performance comparison:
| Gradient boosting trained with Undersampled data | Gradient boosting trained with Oversampled data | AdaBoost trained with Undersampled data | |
|---|---|---|---|
| Accuracy | 0.942 | 0.914 | 0.927 |
| Recall | 0.914 | 0.899 | 0.933 |
| Precision | 0.770 | 0.674 | 0.709 |
| F1 | 0.836 | 0.770 | 0.805 |
Insight
If the priority is to maximize the identification of potential churners (even at the cost of some false positives), then AdaBoost might be the preferred choice.
If a balance between identifying churners and minimizing incorrect predictions is desired, then the Gradient Boosting model with undersampled data could be a better option due to its higher F1 score.
Final thought
# Checking model's performance on test set
tuned_adb_test = model_performance_classification_sklearn(tuned_adb, X_test, y_test)
tuned_adb_test
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.928 | 0.938 | 0.708 | 0.807 |
Insight
Recall: 93.8% of the actual "Attrited Customers" in the test set were correctly identified.
F1 Score: The F1 score of 0.807 provides a balanced measure of the model's precision and recall.
Overall, the final model exhibits strong performance, particularly in its high recall rate. For Thera Bank as it allows them to proactively identify and target customers who are most likely to churn. However, the lower precision indicates that the model may be flagging some customers as potential churn risks who might not actually leave.
Further actions
feature_names = X_train.columns
importances = tuned_adb.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Insight
The top features driving churn predictors
This suggests that a customer's overall transaction behavior, including the number and amount of transactions and the outstanding balance, plays a crucial role in predicting their likelihood to churn.
Least important features
Surprisingly the categorical features showed the least feature importance. indicating that these individual categories don't have a strong direct impact on churn prediction.
It suggests that the bank should focus on understanding and managing customers' transaction behavior, relationship depth, and recent changes in spending patterns to effectively reduce churn.
Based on the final model analysis, we can derive the following business insights and recommendations for Thera Bank.
Transaction behavior is critical
Changes in spending matters
Relationships matter
Demographics less influential
Proactive engagement
Monitoring spending patterns
Rewarding loyalty
Cross-selling opportunities
Data collection and analysis
Model refinement
The analysis reveals that transaction behavior and engagement metrics (like total transaction count and changes in spending) are more significant predictors of churn than demographic factors alone.
Hypotheses that were partially supported:
Hypothesis 4 Customers with longer relationships with the bank are less likely to churn. This might be partially true, as "Months_on_book" wasn't a top predictor, but it could still interact with other factors.
Hypothesis 5 Customers with more products at the bank are more likely to stay due to higher engagement. This has some support, as "Total_Relationship_Count" was identified as a moderately important feature.
Hypothesis 6 Frequent contact with the bank (e.g., customer service) might reduce churn. This wasn't strongly supported in the feature importance analysis.
#Pull up info
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null object 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null object 3 Dependent_count 10127 non-null int64 4 Education_Level 8608 non-null object 5 Marital_Status 9378 non-null object 6 Income_Category 10127 non-null object 7 Card_Category 10127 non-null object 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null int64 10 Months_Inactive_12_mon 10127 non-null int64 11 Contacts_Count_12_mon 10127 non-null int64 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(9), object(6) memory usage: 1.5+ MB
sns.pairplot(data=df, diag_kind="kde")
plt.show()
Insight
# Boxplot age & Education
plt.figure(figsize=(12, 6))
sns.boxplot(x='Education_Level', y='Customer_Age', data=df)
plt.xticks(rotation=45)
plt.show()
Insight
# Boxplot age & Card category
plt.figure(figsize=(12, 6))
sns.boxplot(x='Customer_Age', y='Card_Category', data=df)
plt.xticks(rotation=45)
plt.show()
Insight
# Total_Trans_Amt & Education
plt.figure(figsize=(12, 6))
sns.boxplot(x='Education_Level', y='Total_Trans_Amt', data=df)
plt.xticks(rotation=45)
plt.show()
Insight
# Marital Status & Total_Trans_Amt
plt.figure(figsize=(12, 6))
sns.boxplot(x='Marital_Status', y='Total_Trans_Amt', data=df)
plt.xticks(rotation=45)
plt.show()
Insight
# Age & Total_Ct_Chng_Q4_Q1
plt.figure(figsize=(12, 6))
sns.boxplot(x='Customer_Age', y='Total_Ct_Chng_Q4_Q1', data=df)
plt.xticks(rotation=45)
plt.show()
Insight
# Age & Total_Trans_Amt
plt.figure(figsize=(12, 6))
sns.boxplot(x='Customer_Age', y='Total_Trans_Amt', data=df)
plt.xticks(rotation=45)
plt.show()
Insight
# Convert ipynb to HTML
!jupyter nbconvert --to html /content/AML_Project_LearnerNotebook_FullCode_Final.ipynb
[NbConvertApp] WARNING | pattern '/content/AML_Project_LearnerNotebook_FullCode_Final.ipynb' matched no files
This application is used to convert notebook files (*.ipynb)
to various other formats.
WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES.
Options
=======
The options below are convenience aliases to configurable class-options,
as listed in the "Equivalent to" description-line of the aliases.
To see all configurable class-options for some <cmd>, use:
<cmd> --help-all
--debug
set log level to logging.DEBUG (maximize logging output)
Equivalent to: [--Application.log_level=10]
--show-config
Show the application's configuration (human-readable format)
Equivalent to: [--Application.show_config=True]
--show-config-json
Show the application's configuration (json format)
Equivalent to: [--Application.show_config_json=True]
--generate-config
generate default config file
Equivalent to: [--JupyterApp.generate_config=True]
-y
Answer yes to any questions instead of prompting.
Equivalent to: [--JupyterApp.answer_yes=True]
--execute
Execute the notebook prior to export.
Equivalent to: [--ExecutePreprocessor.enabled=True]
--allow-errors
Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too.
Equivalent to: [--ExecutePreprocessor.allow_errors=True]
--stdin
read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*'
Equivalent to: [--NbConvertApp.from_stdin=True]
--stdout
Write notebook output to stdout instead of files.
Equivalent to: [--NbConvertApp.writer_class=StdoutWriter]
--inplace
Run nbconvert in place, overwriting the existing notebook (only
relevant when converting to notebook format)
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=]
--clear-output
Clear output of current file and save in place,
overwriting the existing notebook.
Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True]
--no-prompt
Exclude input and output prompts from converted document.
Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True]
--no-input
Exclude input cells and output prompts from converted document.
This mode is ideal for generating code-free reports.
Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True --TemplateExporter.exclude_input_prompt=True]
--allow-chromium-download
Whether to allow downloading chromium if no suitable version is found on the system.
Equivalent to: [--WebPDFExporter.allow_chromium_download=True]
--disable-chromium-sandbox
Disable chromium security sandbox when converting to PDF..
Equivalent to: [--WebPDFExporter.disable_sandbox=True]
--show-input
Shows code input. This flag is only useful for dejavu users.
Equivalent to: [--TemplateExporter.exclude_input=False]
--embed-images
Embed the images as base64 dataurls in the output. This flag is only useful for the HTML/WebPDF/Slides exports.
Equivalent to: [--HTMLExporter.embed_images=True]
--sanitize-html
Whether the HTML in Markdown cells and cell outputs should be sanitized..
Equivalent to: [--HTMLExporter.sanitize_html=True]
--log-level=<Enum>
Set the log level by value or name.
Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL']
Default: 30
Equivalent to: [--Application.log_level]
--config=<Unicode>
Full path of a config file.
Default: ''
Equivalent to: [--JupyterApp.config_file]
--to=<Unicode>
The export format to be used, either one of the built-in formats
['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']
or a dotted object name that represents the import path for an
``Exporter`` class
Default: ''
Equivalent to: [--NbConvertApp.export_format]
--template=<Unicode>
Name of the template to use
Default: ''
Equivalent to: [--TemplateExporter.template_name]
--template-file=<Unicode>
Name of the template file to use
Default: None
Equivalent to: [--TemplateExporter.template_file]
--theme=<Unicode>
Template specific theme(e.g. the name of a JupyterLab CSS theme distributed
as prebuilt extension for the lab template)
Default: 'light'
Equivalent to: [--HTMLExporter.theme]
--sanitize_html=<Bool>
Whether the HTML in Markdown cells and cell outputs should be sanitized.This
should be set to True by nbviewer or similar tools.
Default: False
Equivalent to: [--HTMLExporter.sanitize_html]
--writer=<DottedObjectName>
Writer class used to write the
results of the conversion
Default: 'FilesWriter'
Equivalent to: [--NbConvertApp.writer_class]
--post=<DottedOrNone>
PostProcessor class used to write the
results of the conversion
Default: ''
Equivalent to: [--NbConvertApp.postprocessor_class]
--output=<Unicode>
overwrite base name use for output files.
can only be used when converting one notebook at a time.
Default: ''
Equivalent to: [--NbConvertApp.output_base]
--output-dir=<Unicode>
Directory to write output(s) to. Defaults
to output to the directory of each notebook. To recover
previous default behaviour (outputting to the current
working directory) use . as the flag value.
Default: ''
Equivalent to: [--FilesWriter.build_directory]
--reveal-prefix=<Unicode>
The URL prefix for reveal.js (version 3.x).
This defaults to the reveal CDN, but can be any url pointing to a copy
of reveal.js.
For speaker notes to work, this must be a relative path to a local
copy of reveal.js: e.g., "reveal.js".
If a relative path is given, it must be a subdirectory of the
current directory (from which the server is run).
See the usage documentation
(https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html-slideshow)
for more details.
Default: ''
Equivalent to: [--SlidesExporter.reveal_url_prefix]
--nbformat=<Enum>
The nbformat version to write.
Use this to downgrade notebooks.
Choices: any of [1, 2, 3, 4]
Default: 4
Equivalent to: [--NotebookExporter.nbformat_version]
Examples
--------
The simplest way to use nbconvert is
> jupyter nbconvert mynotebook.ipynb --to html
Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'].
> jupyter nbconvert --to latex mynotebook.ipynb
Both HTML and LaTeX support multiple output templates. LaTeX includes
'base', 'article' and 'report'. HTML includes 'basic', 'lab' and
'classic'. You can specify the flavor of the format used.
> jupyter nbconvert --to html --template lab mynotebook.ipynb
You can also pipe the output to stdout, rather than a file
> jupyter nbconvert mynotebook.ipynb --stdout
PDF is generated via latex
> jupyter nbconvert mynotebook.ipynb --to pdf
You can get (and serve) a Reveal.js-powered slideshow
> jupyter nbconvert myslides.ipynb --to slides --post serve
Multiple notebooks can be given at the command line in a couple of
different ways:
> jupyter nbconvert notebook*.ipynb
> jupyter nbconvert notebook1.ipynb notebook2.ipynb
or you can specify the notebooks list in a config file, containing::
c.NbConvertApp.notebooks = ["my_notebook.ipynb"]
> jupyter nbconvert --config mycfg.py
To see all available configurables, use `--help-all`.